AITopics | quality filter

Collaborating Authors

quality filter

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

f9b9f0fef2274a6b7009b5d52f44a3b6-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-11-2026, 04:54:47 GMT

molecule, objective, quality filter, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.32)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.31)

Add feedback

Towards Scalable Meta-Learning of near-optimal Interpretable Models via Synthetic Model Generations

Myint, Kyaw Hpone, Wu, Zhe, Day, Alexandre G. R., Iyengar, Giri

arXiv.org Machine LearningNov-7-2025

Decision trees are widely used in high-stakes fields like finance and healthcare due to their interpretability. This work introduces an efficient, scalable method for generating synthetic pre-training data to enable meta-learning of decision trees. Our approach samples near-optimal decision trees synthetically, creating large-scale, realistic datasets. Using the MetaTree transformer architecture, we demonstrate that this method achieves performance comparable to pre-training on real-world data or with computationally expensive optimal decision trees. This strategy significantly reduces computational costs, enhances data generation flexibility, and paves the way for scalable and efficient meta-learning of interpretable decision tree models.

artificial intelligence, decision tree learning, machine learning, (17 more...)

arXiv.org Machine Learning

2511.04

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

f9b9f0fef2274a6b7009b5d52f44a3b6-AuthorFeedback.pdf

Neural Information Processing SystemsAug-17-2025, 08:54:39 GMT

The fundamental difference is between "many to one" Figure 1 shows example generations from the model trained on the ChEMBL. W e actually did run an RL baseline (Eq. W e discuss the work of Norouzi et al. [2016] in detail in Section 3.3. They also do not use the entropy term in training, only to motivate derivations.

artificial intelligence, machine learning, molecule, (13 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.32)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.31)

Add feedback

GneissWeb: Preparing High Quality Data for LLMs at Scale

Gohari, Hajar Emami, Kadhe, Swanand Ravindra, Adam, Syed Yousaf Shah. Constantin, Adebayo, Abdulhamid, Adusumilli, Praneet, Ahmed, Farhan, Angel, Nathalie Baracaldo, Borse, Santosh, Chang, Yuan-Chi, Dang, Xuan-Hong, Desai, Nirmit, Eres, Ravital, Iwamoto, Ran, Karve, Alexei, Koyfman, Yan, Lee, Wei-Han, Liu, Changchang, Lublinsky, Boris, Ohko, Takuyo, Pesce, Pablo, Touma, Maroun, Wang, Shiqiang, Witherspoon, Shalisha, Woisetschlager, Herbert, Wood, David, Wu, Kun-Lung, Yoshida, Issei, Zawad, Syed, Zerfos, Petros, Zhou, Yi, Bhattacharjee, Bishwaranjan

arXiv.org Artificial IntelligenceFeb-18-2025

Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM's ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models. In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens). We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average score computed on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points advantage over those trained on FineWeb-V1.1.0.

category, dataset, threshold, (12 more...)

arXiv.org Artificial Intelligence

2502.14907

Country:

Europe > United Kingdom (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > Virginia (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry:

Education (1.00)
Information Technology (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Quality-Diversity through AI Feedback

Bradley, Herbie, Dai, Andrew, Teufel, Hannah, Zhang, Jenny, Oostermeijer, Koen, Bellagente, Marco, Clune, Jeff, Stanley, Kenneth, Schott, Grégory, Lehman, Joel

arXiv.org Artificial IntelligenceDec-7-2023

In many text-generation problems, users may prefer not only a single response, but a diverse range of high-quality outputs from which to choose. Quality-diversity (QD) search algorithms aim at such outcomes, by continually improving and diversifying a population of candidates. However, the applicability of QD to qualitative domains, like creative writing, has been limited by the difficulty of algorithmically specifying measures of quality and diversity. Interestingly, recent developments in language models (LMs) have enabled guiding search through AI feedback, wherein LMs are prompted in natural language to evaluate qualitative aspects of text. Leveraging this development, we introduce Quality-Diversity through AI Feedback (QDAIF), wherein an evolutionary algorithm applies LMs to both generate variation and evaluate the quality and diversity of candidate text. When assessed on creative writing domains, QDAIF covers more of a specified search space with high-quality samples than do non-QD controls. Further, human evaluation of QDAIF-generated creative texts validates reasonable agreement between AI and human evaluation. Our results thus highlight the potential of AI feedback to guide open-ended search for creative and original solutions, providing a recipe that seemingly generalizes to many domains and modalities. In this way, QDAIF is a step towards AI systems that can independently search, diversify, evaluate, and improve, which are among the core skills underlying human society's capacity for innovation.

diversity characteristic, human evaluation score, neural information processing system, (16 more...)

arXiv.org Artificial Intelligence

2310.13032

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.13)
North America > Canada > British Columbia (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Promising Solution (0.87)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Health & Medicine > Consumer Health (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
(2 more...)

Add feedback

On the Exploitability of Reinforcement Learning with Human Feedback for Large Language Models

Wang, Jiongxiao, Wu, Junlin, Chen, Muhao, Vorobeychik, Yevgeniy, Xiao, Chaowei

arXiv.org Artificial IntelligenceNov-16-2023

Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences, playing an important role in LLMs alignment. Despite its advantages, RLHF relies on human annotators to rank the text, which can introduce potential security vulnerabilities if any adversarial annotator (i.e., attackers) manipulates the ranking score by up-ranking any malicious text to steer the LLM adversarially. To assess the red-teaming of RLHF against human preference data poisoning, we propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors (e.g., generating longer sequences, which can increase the computational cost). With poisoned dataset generated by RankPoison, we can perform poisoning attacks on LLMs to generate longer tokens without hurting the original safety alignment performance. Moreover, applying RankPoison, we also successfully implement a backdoor attack where LLMs can generate longer answers under questions with the trigger word. Our findings highlight critical security challenges in RLHF, underscoring the necessity for more robust alignment methods for LLMs.

dataset, poisoning attack, rankpoison, (15 more...)

arXiv.org Artificial Intelligence

2311.09641

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > California > Yolo County > Davis (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Cabrita: closing the gap for foreign languages

Larcher, Celio, Piau, Marcos, Finardi, Paulo, Gengo, Pedro, Esposito, Piero, Caridá, Vinicius

arXiv.org Artificial IntelligenceAug-22-2023

The strategy of training the model from scratch in a specific language or domain serves two essential purposes: i) enhancing performance in the particular linguistic or domain context, and ii) ensuring effective tokenization. The main limitation inherent to this approach lies in the associated cost, which can reach six to seven-digit dollar values, depending on the model size and the number of parameters involved. The main solution to overcome the cost challenge is to rely on available pre-trained models, which, despite recent advancements such as the LLaMA and LLaMA-2 models, still demonstrate inefficiency for certain specific domain problems or prove ineffective in scenarios involving conversational memory resources, given the large number of tokens required to represent text. To overcome this issue, we present a methodology named Cabrita, which, as our research demonstrates, successfully addresses the performance and efficient tokenization problem, all at an affordable cost. We believe that this methodology can be applied to any transformer-like architecture model. To validate the study, we conducted continuous pre-training exclusively using Portuguese text on a 3-billion-parameter model known as OpenLLaMA, resulting in a model named openCabrita 3B. The openCabrita 3B also features a new tokenizer that results in a significant reduction in the number of tokens required to represent the text. In our assessment, for few-shot learning tasks, we achieved similar results with this 3B model compared to a traditional continuous pre-training approach as well as to 7B models English pre-trained models.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2308.11878

Country:

South America > Brazil (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
(2 more...)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Gururangan, Suchin, Card, Dallas, Dreier, Sarah K., Gade, Emily K., Wang, Leroy Z., Wang, Zeyu, Zettlemoyer, Luke, Smith, Noah A.

arXiv.org Artificial IntelligenceJan-26-2022

Language models increasingly rely on massive web dumps for diverse text data. However, these sources are rife with undesirable content. As such, resources like Wikipedia, books, and newswire often serve as anchors for automatically selecting web text most suitable for language modeling, a process typically referred to as quality filtering. Using a new dataset of U.S. high school newspaper articles -- written by students from across the country -- we investigate whose language is preferred by the quality filter used for GPT-3. We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality. We then demonstrate that the filter's measurement of quality is unaligned with other sensible metrics, such as factuality or literary acclaim. We argue that privileging any corpus as high quality entails a language ideology, and more care is needed to construct training corpora for language models, with better transparency and justification for the inclusion or exclusion of various texts.

high quality, quality filter, quality score, (16 more...)

arXiv.org Artificial Intelligence

2201.10474

Country:

North America > United States > California (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)
(11 more...)

Genre:

Research Report > New Finding (1.00)
Personal (1.00)
Research Report > Experimental Study (0.69)

Industry:

Media > News (1.00)
Leisure & Entertainment > Sports > Football (1.00)
Law (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)

Add feedback

Auto-Annotation Quality Prediction for Semi-Supervised Learning with Ensembles

Simon, Dror, Farber, Miriam, Goldenberg, Roman

arXiv.org Artificial IntelligenceOct-30-2019

Auto-annotation by ensemble of models is an efficient method of learning on unlabeled data. Wrong or inaccurate annotations generated by the ensemble may lead to performance degradation of the trained model. To deal with this problem we propose filtering the auto-labeled data using a trained model that predicts the quality of the annotation from the degree of consensus between ensemble models. Using semantic segmentation as an example, we show the advantage of the proposed auto-annotation filtering over training on data contaminated with inaccurate labels. Moreover, our experimental results show that in the case of semantic segmentation, the performance of a state-of-the-art model can be achieved by training it with only a fraction (30$\%$) of the original manually labeled data set, and replacing the rest with the auto-annotated, quality filtered labels.

annotation, ensemble, semantic segmentation, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/CVPRW50498.2020.00465

1910.13988

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Czechia > Prague (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.72)

Add feedback